Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

@article{Gan2022VisionLanguagePB,
  title={Vision-Language Pre-training: Basics, Recent Advances, and Future Trends},
  author={Zhe Gan and Linjie Li and Chunyuan Li and Lijuan Wang and Zicheng Liu and Jianfeng Gao},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.09263},
  url={https://api.semanticscholar.org/CorpusID:252918286}
}
This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years, and presents a comprehensive review of state-of-the-art methods.

Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

The Prompt-Transformer (P-Former) is introduced, a model that predicts these ideal prompts to align with visual features, which is trained exclusively on linguistic data, bypassing the need for image-text pairings.

From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities

This work presents a survey in the domain of VQA that delves into the intricacies of V QA datasets and methods over the field's history, introduces a detailed taxonomy to categorize the facets of VZA, and highlights the recent trends, challenges, and scopes for improvement.

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Surprisingly, experimental results show that this unified VidL framework LAVENDER achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning.

A Review of Deep Learning for Video Captioning

This survey covers deep learning-based VC, including but, not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC), and more.

Language Models as Black-Box Optimizers for Vision-Language Models

This work proposes employing chat-based LLMs to search for the best text prompt for VLMs and highlights the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search.

Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding

It is concluded that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning, providing principled guidelines for the choice of text encoders used in such contexts.

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

Rephrase, Augment and Reason (RepARe), a gradient-free framework that extracts salient details about the image using the underlying LVLM as a captioner and reasoner, in order to propose modifications to the original question, is presented.

On the Hidden Mystery of OCR in Large Multimodal Models

This study conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition, text-based visual question answering, key information extraction, and handwritten mathematical expression recognition.

Visual Instruction Tuning

This paper presents LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding and introduces GPT-4 generated visual instruction tuning data, the model and code base publicly available.

Vision-Text Cross-Modal Fusion for Accurate Video Captioning

This paper introduces a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data that captures the visual-textual inter-model relationships using cross-correlation and encodes the interdependencies between text and video information using attention mechanisms.
...